As consultants, we encounter projects from many different disciplines. Our clients are often fitting linear models. They may have data similar to the ChickWeight data found in the R datasets.
# Look at the first few rows of the data
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
# Look at the structure of the data
str(data.frame(ChickWeight))
## 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
Suppose that the client would like to determine if there is a difference in average chick weights measured on the final measurement day (day 21) between the diet treatments.
# Subset the data to only inlcude the measurements taken on day 21
# (the final measurement time)
ChickWeight_final <- ChickWeight %>% filter(Time == 21)
# Create boxplots of the chick weights for each diet
ggplot(ChickWeight_final, aes(x = Diet, y = weight)) +
geom_boxplot() +
labs(y = "Weight")
The client may come to us with the following analysis and ask if what they have done is correct.
# Fit a linear model
(lm_model <- lm(weight ~ Diet, data = ChickWeight_final))
##
## Call:
## lm(formula = weight ~ Diet, data = ChickWeight_final)
##
## Coefficients:
## (Intercept) Diet2 Diet3 Diet4
## 177.75 36.95 92.55 60.81
# Look at the anova table with type III sums of squares
# (there are a different number of chicks per treatment)
car::Anova(lm_model, type = "III")
## Anova Table (Type III tests)
##
## Response: weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 505521 1 123.4892 6.069e-14 ***
## Diet 57164 3 4.6547 0.006858 **
## Residuals 167839 41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look at the lsmeans and pairwise differences
emmeans::emmeans(lm_model, pairwise ~ Diet)
## $emmeans
## Diet emmean SE df lower.CL upper.CL
## 1 178 16.0 41 145 210
## 2 215 20.2 41 174 256
## 3 270 20.2 41 229 311
## 4 239 21.3 41 195 282
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## 1 - 2 -37.0 25.8 41 -1.433 0.4868
## 1 - 3 -92.5 25.8 41 -3.588 0.0047
## 1 - 4 -60.8 26.7 41 -2.281 0.1193
## 2 - 3 -55.6 28.6 41 -1.943 0.2264
## 2 - 4 -23.9 29.4 41 -0.811 0.8487
## 3 - 4 31.7 29.4 41 1.080 0.7036
##
## P value adjustment: tukey method for comparing a family of 4 estimates
What has been done seems reasonable, but the client has forgotten to look at residual plots to check the model assumptions. This is a common oversight.
# Create a residual plot
plot(resid(lm_model) ~ fitted(lm_model))
abline(h = 0)
# Create a normal quantile plot
qqnorm(resid(lm_model))
qqline(resid(lm_model))
plot function# Create residual diagnostic plots
plot(lm_model)
# Create a dataset with the residuals and fitted values
ggdata <- data.frame(resid = resid(lm_model),
fitted = fitted(lm_model))
head(ggdata)
## resid fitted
## 1 27.25 177.75
## 2 37.25 177.75
## 3 24.25 177.75
## 4 -20.75 177.75
## 5 45.25 177.75
## 6 -20.75 177.75
# Create a residual plot
ggplot(ggdata, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0) +
labs(x = "Fitted Values", y = "Residuals")
# Create a normal quantile plot
library(qqplotr)
ggplot(ggdata, mapping = aes(sample = resid)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
Many of the clients we work with use SAS. We both like the residualpanel plot option in proc mixed, because it provides a panel of residual diagnostic plots.
The following SAS code fits the same linear model in SAS and outputs the residual panel of plots shown below.
proc import out = ChickWeight_final
datafile = '/folders/myfolders/ggResidpanel example/ChickWeight_final.csv'
dbms = csv
replace;
getnames = yes;
datarow = 2;
run;
proc mixed data = ChickWeight_final plots = residualpanel;
class Diet;
model weight = Diet;
run;
We were inspired by the SAS residual panel to create an R package that provided an easy way to view multiple diagnostic plots created using ggplot2 at the same time. Additionally, we wanted the package to…
This led to the creation of ggResidpanel.
library(ggResidpanel)
resid_panel(lm_model)
Follow these instructions to install ggResidpanel.
# (1) Install the package devtools (if not already installed)
install.packages("devtools")
# (2) Install ggResidpanel from the GitHub repository
devtools::install_github("goodekat/ggResidpanel")
# (3) Load the package
library(ggResidpanel)
resid_panelresid_interactThe function resid_interact creates interactive versions of the residual diagnostic plots. resid_interact makes use of plotly to show additional information about a point when the mouse is hovered over it. The tool tip shows:
All formatting options that were available with resid_panel are available with resid_interact.
For now, only one plot can be shown at a time. The available plots are as follows:
boxplot: A boxplot of residualscookd: A plot of Cook’s D values versus observation numbershist: A histogram of residualsindex: A plot of residuals versus observation numbersls: A location scale plot of the residualsqq: A normal quantile plot of residualslev: A plot of leverage values versus residualsresid: A plot of residuals versus predicted valuesyvp: A plot of observed response values versus predicted valuesBelow are some interactive versions of residual plots created using resid_interact.
resid_interact(lm_model, type = "standardized", title.opt = FALSE)
resid_interact(lm_model, plot = "cookd", theme = "grey")
resid_auxpanelSince the function resid_panel only works with certain model types, we wanted to include an additional function (or an auxiliary function) that can be used with any model. This led to the creation of resid_auxpanel.
Similar to resid_panel, resid_auxpanel creates a panel of residual diagnostic plots, but instead of inputting a model in the function, the residuals and fitted values are input. This causes some plot options to no longer be available. However, all formatting options are still available with resid_auxpanel.
plots:all: This creates a panel of all plot types included in the package that are available for resid_auxpaneldefault: This creates a panel of a residual plot, a normal quantile plot of the residuals, an index plot of the residuals, and a histogram of the residuals. (This is the default option.)SAS: This creates a panel of a residual plot, a normal quantile plot of the residuals, a histogram of the residuals, and a boxplot of the residuals.boxplot: A boxplot of residualshist: A histogram of residualsqq: A normal quantile plot of residualsresid: A plot of residuals versus predicted valuesLet us fit a random forest model to the mtcars data to predict the mpg based on all other variables. We would like to look at the residuals to see if there is any pattern in the residuals in relationship to the fitted values. A histogram of the residuals is also requested.
# Fit a random forest model to the mtcars data to predict the mpg
rf_model <- randomForest::randomForest(x = mtcars[,2:11], y = mtcars[,1])
# Obtain the predictions from the model on the observed data
rf_pred <- predict(rf_model, mtcars[,2:11])
# Obtain the residuals from the model
rf_resid <- mtcars[,1] - rf_pred
# Create a panel with the residual and index plot
resid_auxpanel(rf_resid, rf_pred, plots = c("resid", "index"), theme = "classic")